High performance implementation of the fft butterfly computation

ABSTRACT

This invention is a FFT butterfly circuit. This circuit includes four temporary data registers connected to three memories. The three memories include read/write X and Y memories and a read only twiddle coefficient memory. A multiplier-accumulator forms a product and accumulates the product with one of two accumulator registers. A register file with plural registers is loaded from one of the accumulator registers or the fourth temporary data register. An adder/subtracter forms a selected one of a sum of registers or a difference of registers. A write buffer with two buffers temporarily stores data from the adder/subtracter before storage in the first or second memory. The X and Y memories must be read/write but the twiddle memory may be read only.

CLAIM OF PRIORITY

This application claims priority under 35 U.S.C. 119(e)(1) to U.S. Provisional Application No. 62/078,170 filed Nov. 11, 2014.

TECHNICAL FIELD OF THE INVENTION

The technical field of this invention is data processors for Fast Fourier Transform (FFT) butterfly computations.

BACKGROUND OF THE INVENTION

An FFT butterfly performs the following calculations:

T _(r)=(X _(2r) *W _(r))−(X _(2i) *W _(i));

T _(i)=(X _(2i) *W _(r))−(X _(2r) *W _(i));

X _(1r) ′=X _(1r) +R _(r);

X _(1i) ′=X _(1i) +T _(i);

X _(2r) ′=X _(1r) −T _(r);

X _(2i) ′=X _(1i) −T _(i);

where: X_(1r) and X_(2r) are respective first and second real coefficients; X_(1i) and X_(2i) are respective first and second imaginary coefficients; W_(r) is a real twiddle factor; W_(i) is an imaginary twiddle factor; T_(r) is a real temporary variable; T_(i) is an imaginary temporary variable; and X_(1r)′, X_(2r)′, X_(1i)′ and X_(2i)′ are respective updated coefficients. These calculations can be efficiently performed in 4 cycles/butterfly employing: a single multiplier/accumulator; an adder; plus overhead circuits, measured relative to theoretical minimum overhead of 4*(N/2)*log2(N).

Existing techniques, such as the Ceva TeakLite III DSP, have higher overhead partly because they only have enough memory bandwidth to support the data reads and writes. Extra memory bandwidth cycles are required to read new twiddle factors from memory.

SUMMARY OF THE INVENTION

This invention is a FFT butterfly circuit. This circuit includes four temporary data registers connected to three memories. The three memories include read/write X and Y memories and a read only twiddle coefficient memory. A multiplier-accumulator forms a product and accumulates the product with one of two accumulator registers. A register file with plural registers is loaded from one of the accumulator registers or the fourth temporary data register. An adder/subtracter forms a selected one of a sum of registers or a difference of registers. A write buffer with two buffers temporarily stores data from the adder/subtracter before storage in the first or second memory. The X and Y memories must be read/write but the twiddle memory may be read only.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects of this invention are illustrated in the drawings, in which:

FIG. 1 illustrates the FFT butterfly circuit of this invention; and

FIG. 2 illustrates the sequence of operations using the hardware of FIG. 1 to calculate an FFT butterfly.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

This invention achieves low overhead in FFT butterfly calculations by employing the following. This invention uses both of the X and Y data/coefficient memories to provide the needed data bandwidth, using X for the real data and Y for the imaginary data. This invention uses a third Z read-only memory for the twiddle factor coefficients. The register configuration of this invention supports 4-cycle data flow. This invention employs two accumulators, an XT register to save X value for reuse, an M register to pipeline data to the register file and 4 general purpose registers. This invention employs a write buffer that allows output data writes to be aligned with the memory cycles not used for data reads.

FIG. 1 illustrates FFT butterfly circuit 100 of this invention. The X coefficients are stored in X memory 101. The Y coefficients are stores in Y memory 102. Memories 101 and 102 must be read/write memories (such as DRAM) to store the data input and the calculation results. The twiddle factors are stored in Z memory 103. These twiddle factors do not change. Therefore Z memory 103 need be only a read memory (such as ROM). Z memory 103 may also be read/write memory as used for memories 101 and 102.

There are four temporary data registers 104, 105, 106 and 107 connected as follows. X_(T) register 104 can be loaded from X register 105. Data stored in X_(T) register 104 can be transferred to X register 105. X register 105 can be loaded from X_(T) register 104, from data recalled from a particular address in X memory 101 and from data recalled from a particular address in Z memory 103. The output from X register 105 may be loaded into X_(T) register 104 and supplied to a first product input of multiply accumulator 108. Y register 106 can be loaded from data recalled from a particular address in Y memory 102 and from data recalled from a particular address in Z memory 103. The output from Y register 106 is supplied to a second product input of multiply accumulator 108. M register 107 can be loaded from data recalled from a particular address in X memory 101 and from data recalled from a particular address in Y memory 102. The output from M register 107 is stored in a specified register (R₀, R₁, R₂ or R₃) of register file 110.

Multiply accumulator 108 includes two product inputs connected to respective X register 105 and Y register 106. Multiply accumulator 108 includes a sum input from accumulator 109. Multiply accumulator 108 forms a product of the two product inputs and forms a selected addition or subtraction of this product with the sum input. The result is stored in accumulator 109 replacing the accumulator value supplied to the sum input. Thus the product is added to or subtracted from one of the accumulator contents and the sum is stored back in the same accumulator.

Accumulator 109 includes one input from multiply accumulator 108. Accumulator 109 stores this input into a selected one of Acc0 or Acc1. Accumulator 109 includes two outputs. Accumulator 109 may transfer data from a selected one of Acc0 or Acc1 to a specified register (R₀, R₁, R₂ or R₃) of register file 110. Accumulator 109 supplies data from a selected one of Acc0 or Acc1 to the sum input of multiply accumulator 108.

Register file 110 has two inputs. Register file 110 may be loaded from accumulator 109 or from M register 107. This load is always to a specified register (R₀, R₁, R₂ or R₃). Register file 110 has two outputs. Data from a first specified register (R₀, R₁, R₂ or R₃) may be supplied to a first input of adder/subtracter 111. Data from a second specified register (R₀, R₁, R₂ or R₃) may be supplied to a second input of adder/subtracter 111.

Adder/subtracter 111 has two inputs from register file 110. Adder/subtracter 111 forms a selected one of the sum or the difference of its two inputs. The sum or difference output of adder/subtracter 111 is supplied to write buffer 112.

Write buffer 112 has one input receiving the sum or difference output of adder/subtracter 111. Write buffer 112 includes two buffers B₀ and B₁. The sum or difference output received from adder/subtracter 111 is stored in a selected one of these buffers B₀ and B₁. Write buffer 112 has one output. Write buffer 112 may transfer data from a selected one of buffers B₀ or B₁ to a particular address in X memory 101 or to a particular address in Y memory 102. This transfer stores the results of the butterfly computation back to memory.

FIG. 2 illustrates table 200 showing how FFT butterfly circuit 100 is controlled to compute the butterfly. Each row of table 200 is one operational cycle of FFT butterfly circuit 100. The first column lists the operational cycle. The second column lists memory operations. The third column lists multiply accumulate operations. The fourth column lists data transfers into register file 110. The fifth column lists operations of adder/subtracter 111.

During cycle 1, X register 105 is loaded from X memory 101 from an address storing a second real term X_(2r). Also during cycle 1, Y register 106 is loaded from Z memory 103 from an address storing a real twiddle factor W_(r).

During cycle 2, X register 105 is loaded from Z memory 103 from an address storing the real twiddle factor W_(r). Y register 106 is loaded from Y memory 102 from an address storing a second imaginary term X_(2r). Multiply accumulator 108 and accumulator 109 store the product of X and Y in Acc₀. Lastly, still during cycle 2, X_(T) register 104 is loaded from X register 105.

During cycle 3, M register 107 is loaded from X memory 101 from an address storing a first real term X_(1r). X register 105 is loaded from Z memory 103 from an address storing an imaginary twiddle factor W_(i). Also during cycle 3, multiply accumulator 108 and accumulator 109 store the product of X and Y in Acc₁.

During cycle 4, M register 107 is loaded from Y memory 102 from an address storing a first imaginary term X_(1i). Y register 105 is loaded from Z memory 103 from an address storing an imaginary twiddle factor W_(i). Multiply accumulator 108 and accumulator 109 store the difference between Acc₀ and the product of X and Y in Acc₀. X register 105 is loaded from X_(T) register 104. Lastly, register R₀ is loaded from M register 107.

During cycle 5, multiply accumulator 108 and accumulator 109 store the sum of Acc₁ and the product of X and Y in Acc₁. Register R₁ is loaded from M register 107. Lastly, register R₂ is loaded from Acc₀.

During cycle 6, register R₃ is loaded from Acc₁. Also during cycle 6, adder/subtracter 111 subtracts R₂ from R₀ and stores the difference in write buffer B₀.

During cycle 7, adder/subtracter 111 subtracts R₃ from R₁ and stores the difference in write buffer B₁.

During cycle 8, the contents of write buffer B₀ are written into X memory 101 at an address of a second real term X_(2r). Also during cycle 8, adder/subtracter 111 adds R₀ to R₂ and stores the sum in write buffer B₀.

During cycle 9, the contents of write buffer B₁ are written into Y memory 102 at an address of a second imaginary term X_(2i). Also during cycle 9, adder/subtracter 111 adds R₁ to R₃ and stores the sum in write buffer B₁.

During cycle 10, the contents of write buffer B₀ are written into X memory 101 at an address of a first real term X_(1t).

During cycle 11, the contents of write buffer B₁ are written into Y memory 102 at an address of a first imaginary term X_(1i).

The memory configuration of this invention provides the needed data/coefficient read/write bandwidth without requiring a double width bus to data memory. The register configuration and write buffer efficiently of this invention supports 4-cycle/butterfly calculation. The inventors believe this invention provides the highest FFT performance of any known DSP in an area efficient engine. This invention addresses a broad application space. 

What is claimed is:
 1. A FFT butterfly circuit comprising: a first temporary data register having an input and an output; a second temporary data register having a first input connected to said output of said first temporary register, a second input loadable from a specified address of a first memory, a third input loadable from a specified address of a second memory and an output connected to said input of said first temporary data register; a third temporary data register having a first input loadable from a specified address of a third memory, a second input loadable from a specified address of said second memory and an output; a fourth temporary data register having a first input loadable from a specified address of said first memory, a second input loadable from a specified address of said third memory and an output; a multiplier-accumulator having a first product input connected to said output of said second temporary register, a second product input connected to said output of said third temporary register, a sum input and an output, said multiplier-accumulator forming a product of data on said first product input and said second product input and selectable adding to or subtracting from data on said sum input, and a output; an accumulator including two accumulator registers having one input connected to said output of said multiply accumulator for storing data on said input in a specified one of said accumulator registers, a first output supplying data from the non-specified accumulator register and a second output connected to said sum input of said multiplier accumulator supplying data from the specified accumulator register; a register file including a plurality of registers having a first input connected to said first output of said accumulator for storing data at said first input in a first specified one of the plurality of registers, a second input connected to said output of said fourth temporary register for storing data at said second input in a second specified one of the plurality of registers, a first output for supplying data from a third specified one of the plurality of registers and a second output for supplying data from a fourth specified one of the plurality of registers; an adder/subtracter having a first input connected to said first output of said accumulator, a second input connected to said second output of said accumulator and an output supplying a selected one of a sum of data on said first input and on said second input or a difference of said second input from said first input; and a write buffer including two buffers having an input connected to said output of said adder/subtracter for storing data at said input in a specified one of the buffers and an output supplying data from a specified one of the buffers.
 2. The FFT butterfly circuit of claim 1 further comprising: a first memory having an input connected to said output of said write buffer and an output connected to said second input of said second temporary register and to said first input of said fourth temporary register; a second memory having an output connected to said third input of said second temporary register and to said second input of said third temporary register; and a third memory having an input connected to said output of said write buffer and an output connected to said first input of said third temporary register and to said second input of said fourth temporary register.
 3. The FFT butterfly circuit of claim 1, wherein: said first memory and said third memory comprise read/write memory; and said second memory comprises read only memory.
 4. A method of performing a FFT butterfly operation comprising the steps of: during a first operational cycle multiplying a first real input factor by a corresponding real twiddle coefficient, and storing a resulting product in a first accumulator register; during a second operational cycle multiplying a first imaginary input factor by the real twiddle coefficient, and storing a resulting product in a second accumulator register; during a third operational cycle multiplying a first imaginary input factor by a corresponding imaginary twiddle coefficient, subtracting the resulting product from the value stored in the first accumulator register, storing a resulting difference in the first accumulator register, and storing a second real input factor in a first data register; during a fourth operational cycle multiplying the first real input factor by the imaginary twiddle coefficient, adding the resulting product to the value stored in the second accumulator register, storing a resulting sum in the second accumulator register, storing a second imaginary coefficient in a second data register, and storing the value stored in the first accumulator register in a third data register; during a fifth operational cycle storing the value stored in the second accumulator register in a fourth register, subtracting the value stored in the third register from the value stored in the first register, storing the resulting difference in a first write buffer; during a sixth operational cycle subtracting the value stored in the fourth register from the value stored in the second register, and storing the resulting difference in a second write buffer; during a seventh operational cycle adding the value stored in the first register to the value stored in the third register, storing the resulting sum in the first write buffer; and during an eighth operational cycle adding the value stored in the second register to the value stored in the fourth register, and storing the resulting sum in the second write buffer.
 5. The method of claim 4, further comprising the steps of: before the first operational cycle recalling from a first memory the first real factor, storing the first real factor in a first input register, recalling from a second memory the real twiddle coefficient, and storing the real twiddle coefficient in a second input register; during the first operational cycle said step of multiplying multiplies the value stored in the first input register by the value stored in the second input register, storing the value in the first input register in a third input register, recalling from the second memory the real twiddle coefficient, storing the real twiddle coefficient in the first input register, recalling the first imaginary factor from a third memory, and storing the first imaginary factor in the second input register; during the second operational cycle said step of multiplying multiplies the value stored in the first input register by the value stored in the second input register, recalling the second real factor from the first memory, storing the second real factor in a fourth input register, recalling an imaginary twiddle coefficient from the second memory, and storing the imaginary twiddle coefficient in the first input register; during the third operational cycle said step of multiplying multiplies the value stored in the first input register by the value stored in the second input register, recalling the second real factor from the third memory, storing the second real factor in the fourth input register, storing the value in the third input register in the first input register, recalling the imaginary twiddle coefficient from the second memory, and storing the imaginary twiddle coefficient in the second input register; during the seventh operational cycle storing the value stored in the first write buffer during the fifth operational cycle into the first memory; during the eighth operational cycle storing the value stored in the second write buffer during the sixth operational into the second memory; during a ninth operational cycle storing the value stored in the first write buffer during the seventh operational cycle into the second memory; and during a tenth operational cycle storing the value stored in the second write buffer during the eighth operational cycle into the second memory.
 6. The method of claim 5, wherein: said first memory and said third memory comprise read/write memory; and said second memory comprises read only memory. 